Ranking Scientific Publications Using a Simple Model of Network Traffic 
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To account for strong aging characteristics of citation networks, we modify Google's PageRank 
algorithm by initially distributing random surfers exponentially with age, in favor of more recent 
publications. The output of this algorithm, which we call CiteRank, is interpreted as approximate 
traffic to individual publications in a simple model of how researchers find new information. We 
develop an analytical understanding of traffic flow in terms of an RPA-like model and optimize 
parameters of our algorithm to achieve the best performance. The results are compared for two 
rather different citation networks: all American Physical Society publications and the set of high- 
energy physics theory (hep-th) preprints. Despite major differences between these two networks, we 
find that their optimal parameters for the CiteRank algorithm are remarkably similar. 



Due to their rapid growth and large size, many in- 
formation networks have become untenable to navigate 
without some sort of ranking scheme. This is particu- 
larly evident in the example of the World Wide Web, a 
network of pages connected by hyperlinks. A successful 
solution to the problem of ranking the Web is Google's 
PageRank algorithm [1]. Another class of information 
networks that could benefit from such a ranking method 
are citation networks. These networks are comprised of 
scientific publications connected by citation links. 

Current methods of ranking publications based on the 
total number of citations received are rather crude. They 
are too "democratic" in treating all citations as equal 
and ignoring differences in importance of citing papers. 
One of the advantages of Google's PageRank algorithm 
is that it implicitly accounts for the importance of the 
citing article in a self-consistent fashion. Authors of [2] 
proposed using the PageRank algorithm to improve the 
formula used to calculate the impact factor of scientific 
journals. In [3] some of us directly applied this algorithm 
to individual papers published in all American Physical 
Society journals. This allowed us to discover a set of 
highly influential papers ("scientific gems") that would 
be undervalued based on just their number of citations. 
However, there exist significant differences between the 
World Wide Web and citation networks that suggest a 
modification of the original PageRank algorithm. The 
most important difference is that, unlike hyperlinks, ci- 
tations cannot be updated after publication. This makes 
aging effects [4, 5] in citation networks much more pro- 
nounced than in the WWW. The other consequence is 
the inherent time-arrow present in the topology of cita- 
tion networks, due to the constraint that a paper may 
only cite earlier works. This significantly alters the spec- 
tral properties of the adjacency matrix which lie at the 
heart of the PageRank algorithm. In particular, the ab- 
sence of directed loops means that the adjacency matrix 



can have only zero eigenvalues. 

The success of the PageRank algorithm can be at- 
tributed, in part, to its ability to capture the behavior 
of people randomly browsing the network of web pages. 
Indeed, the PageRank of a given web page can be inter- 
preted as the predicted traffic (quantified e.g., by the rate 
of downloads) for that page if every WWW user follows a 
random path of (on average) 1 /a hyperlinks starting from 
a randomly selected webpage. The assumption that a 
typical web-surfer starts at a randomly selected webpage 
might be not completely unreasonable for the WWW, 
but it needs to be modified for citation networks. As all 
of us know, researchers typically start "surfing" scientific 
publications from a rather recent publication that caught 
their attention on a daily update of a preprint archive or 
a recent volume of a journal. Thus a more realistic model 
for the traffic along the citation network should take into 
account that researchers preferentially start their quests 
from recent papers and progressively get to older and 
older papers with every step. 

In this work we introduce the CiteRank algorithm, an 
adaptation of the PageRank algorithm to citation net- 
works. Our algorithm simulates the dynamics of a large 
number of researchers looking for new information. Ev- 
ery researcher, independent of one another, is assumed 
to start his/her search from a recent paper or review and 
to subsequently follow a chain of citations until satisfied. 
Explicitly, we define the following two-parameter CiteR- 
ank model of such a process, allowing one to estimate 
the traffic Ti(rdir, a) to a given paper i. A recent paper 
is selected randomly from the whole population with a 
probability that is exponentially discounted according to 
the age of the paper, with a characteristic decay time of 
Tdir- At every step of the path, with probability a the 
researcher is satisfied and halts his/her line of inquiry. 
With probability (1 — a) a random citation to an adja- 
cent paper is followed. The predicted traffic, Ti (r^r , a) , 
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to a paper is proportional to the rate at which it is vis- 
ited if a large number of researchers independently follow 
such a simple-minded process. 

While we interpret the output of the CiteRank algo- 
rithm as the traffic, its utility ultimately lies in the ability 
to successfully rank publications. High CiteRank traffic 
to a publication denotes its high relevance in the context 
of currently popular research directions, while the PageR- 
ank number is more of a "lifetime achievement award" [3] . 
It is fruitful to compare the CiteRank traffic to a paper, 
Tj, with the more traditional method of ranking publica- 
tions, the number of citations received. Indeed, the two 
are highly correlated; a result easily understood on the 
basis that the larger the number of citations a paper has, 
the more likely it will be visited by a researcher via one 
of the incoming links. 

However, the more refined CiteRank algorithm sur- 
passes the conventional ranking, by number of citations, 
in its characterization of relevancy on two accounts. Like 
the original PageRank algorithm [1][2], in CiteRank, the 
popularity of papers is calculated in a self-consistent fash- 
ion: The effect of a citation from a more popular paper 
is greater that that of a less popular one. A citation from 
a paper that is "highly visible" will contribute more to 
the visibility of the cited paper. Furthermore, the age of 
a citing paper is intrinsically accounted for. The effect 
of a recent citation to a paper is greater than that of an 
older citation to the same paper. New citations indicate 
the relevancy of a paper in the context of current lines of 
research. 

An algorithmic description of the aforementioned 
model can be understood as follows. The transfer ma- 
trix associated with the citation network is Wij = l/k° ut 
if j cites i and otherwise, where k° ut is the out-degree 
of the jth paper. Let pi, the probability of initially se- 
lecting the i th paper in a citation network, be given by 
Pi = er agei l Tdw . The probability that the researcher will 
encounter a paper by initial selection alone is given by 
p. Similarly, the probability of encountering the paper 
after following one link is (1 — a)W ■ p. The CiteRank 
traffic of the paper is then defined as the probability of 
encountering it via paths of any length: 

f = I ■ p + (1 - a)W ■ p + (1 - afW 2 •/?+••• (1) 

Practically, we calculate the CiteRank traffic on all pa- 
pers in our dataset by taking successive terms in the 
above expansion to sufficient convergence (< 1CP 10 of 
the average value). 

In order to assess the viability of this ranking scheme 
and to select optimal parameters (Tdi r ,a), we need a 
quantitative measure of its performance on real cita- 
tion networks. Two real citation networks are evaluated. 
Hep-th: An archive snapshot of the "high energy physics 
theory" archive from April 2003 (preprints ranging from 
1992 to 2003). This dataset, containing around 28,000 
papers and 350,000 citation links, was downloaded from 
[6]. Physrev: Citation data between journals published 



by the American Physical Society [7]. This dataset con- 
tains around 380,000 papers and 3,100,000 citation links 
ranging from 1893 to 2003. 

Of course, evaluating the performance of any ranking 
scheme is a delicate, but often necessary, matter. One 
way to select the best performing a and Tdir is to optimize 
the correlation between the predicted traffic, Ti(jdir,ct) 
and the actual traffic (e.g., downloads). Unfortunately, 
the actual traffic data for scientific publications are not 
readily available for these networks. However, it is rea- 
sonable to assume that traffic to a paper is positively 
correlated with the number of new citations it accrues 
over a recent time interval, Aki n . For lack of better in- 
tuition we first assume a linear relationship between ac- 
tual traffic and number of recent citations accrued. This 
corresponds to a simple-minded scenario in which every 
researcher downloading a paper will, with a small proba- 
bility, add it to the citation list of the manuscript he/she 
is writing [8] . In order to compare CiteRank with actual 
citation accrual, we constructed an historical snapshot of 
the networks. In both cases, the most recent 10 percent 
of papers are pruned from the network. The CiteRank 
traffic, Tj, of the remaining 90 percent of the papers is 
then evaluated and correlated with their actual accrual 
of new citations, Aki n , originating at the most recent 10 
percent of papers. It is important to note the qualitative 
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FIG. 1: The Pearson (linear) correlation coefficient between 
the number of recent citations accrued (Aki„) and CiteRank 
traffic (Ti) is calculated over the parameter space of the Cit- 
eRank model for the hep-th (A) and physrev (B) network. 
Both networks exhibit peaks in correlation coefficient in the 
a-Tdir plane. The highest correlation is achieved for a — 0.48, 
Tdir = 1 year in the hep-th network and a — 0.50, Tdir = 2.6 
years, in the physrev network. 

and quantitative differences between the two citation net- 
works considered. The Physical Review citation network 
(physrev) is comprised of a large number (~ 400,000) of 
peer-reviewed publications acquired over a period close 
to a hundred years. The high-energy physics archive 
citation network (hep-th) is comprised completely of a 
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much smaller number (~ 28000) of electronically sub- 
mitted publication preprints, with no associated form of 
peer review. Despite these significant differences in the 
nature of the networks considered, the general features of 
their correlation contours are outstandingly similar. In 
both cases, a single sharp peak in correlation is evident 
for particular values of the parameters. The value of the 
optimal parameters for both networks are: 

hep-th: a = 0.48, r dir — 1 year 

physrev: a = 0.50, r dir = 2.6 years 

Remarkably, the value of a is nearly the same for two 
rather different networks considered here and is in agree- 
ment with that proposed in [3] on purely empirical 
grounds. The difference in optimal parameter Tdir for 
these networks is in agreement with the common-sense 
expectation of faster response time (and hence faster ag- 
ing of citations) in preprint archives compared to peer- 
reviewed publications. Another feature of Fig. 1 is 
that, in both networks, large values of the correlation 
coefficient are concentrated along a diagonally-positioned 
ridge. In other words, the best choice of a for a given t<h t 
seems to rise linearly with Tdir, a behavior that will be 
revisited later in this text. The resultant CiteRank traffic 
and corresponding ranking for the two citation networks 
can be accessed here [9] . 

While the correlation contour plots shown in Fig. 1 
are a promising indication that the CiteRank model of 
traffic provides a good zero-order approximation to the 
actual traffic along a citation network, they are to some 
extent predicated on the assumption of a linear relation- 
ship between actual traffic and Afcj„. One might readily 
ask how this model fares in the absence of such an as- 
sumption. While the assumption of a linear relationship 
may be unreasonable, a positive, monotonic relationship 
between these quantities is certainly expected. There is a 
statistical correlation method precisely adapted for such 
a situation, namely, the Spearman rank correlation. Un- 
der this relaxed correlation measure, only the rank of Tj 
are correlated with the rank of Afc in . Numerical changes 
in Tj that do not lead to reordering have no effect on the 
value of the rank correlation coefficient. Another ratio- 
nale for using rank correlations is that our ultimate goal 
is ranking publications, not modeling the traffic. Thus, 
we are currently not interested in individual Tj's, but 
only in their relative values. Spearman correlation con- 
tour plots are constructed for both networks and shown 
in Fig. 2. The optimal values for both networks are: 

hep-th: a = 0.31, Tdir — 1-6 year 

physrev: a — 0.55, Tdir — 8 years 

These results roughly confirm the prediction of a ~ 0.5 
from Fig. 1, however there is a more appreciable discrep- 
ancy in Tdir between linear and rank correlation for both 
networks. 

In both panels of Fig. 1, over a broad range of parame- 
ters, the optimal value of a(r d i r ) for a given value of Tdir 
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FIG. 2: The Spearman rank correlation coefficient between 
recent citations accrued (Akin) and CiteRank traffic (T) for 
the hep-th (A) and physrev (B) network. Both networks ex- 
hibit similar behavior. There are more extended regions of 
good correlation relative to the linear correlation contours of 
fig. 1. This broadening is expected as a consequence of the 
more relaxed correlation measure. The highest rank corre- 
lation occurs for a = 0.31, Tdir = 1.6 years, in the hep-th 
network and a — 0.55, r dir = 8 years, in the physrev net- 
work. 



is positively correlated with Tdir- This is an indication 
that these two parameters are entangled. In fact, this is 
to be expected as it is some admixture of the two param- 
eters which leads to the exposure of a given paper to the 
researcher. An intuitive picture of this entanglement can 
be understood in terms of the penetration depth, which is 
a measure of how far back in time a random surfer follow- 
ing rules of the CiteRank algorithm is likely to get. The 
penetration depth is affected by both Tdir - the average 
age of the initial paper at which he/she started follow- 
ing the chain of citations, and 1/a - the mean number of 
steps on this chain of citations. For small Tdir and large 
a, the penetration depth is small, implying that only very 
recent papers receive traffic. On the other hand, for large 
T d ir and small a, the penetration depth is very large, in- 
dicating that most of the traffic is directed towards older 
papers. 

To better understand how a and Tdir influence the age 
distribution of CiteRank traffic, we performed the fol- 
lowing quantitative analysis. Let T tot (i) denote the to- 
tal CiteRank model traffic to papers written exactly t 
years ago. As described by Eq. 1, two distinct pro- 
cesses contribute to T tot (t). The first is the "direct" traf- 
fic Tdir{t) due to the initial selection of papers in this 
age group, which is proportional to exp(— t/Tdi r ) [11]. 
The second is the "indirect" traffic Ti n d{t) arriving via 
one of the incoming citation links. The latter is given by 
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T lnd {t) = (1 - a) J t °°T tot (t')P c (t',t)dt', where P c (t',t) is 
the fraction of citations originating from papers of age 
t' that cite papers of age t. It should be noted that 
P c (t',t) is an empirical distribution and, as such, is a 
measured property of the citation network under consid- 
eration. According to [5] and our own findings, P c (t',t) 
is reasonably well approximated by the exponential form 
— exp(— (t' — t) /t c ). Taking the Fourier transform of the 
equation T tot (t) = T dlr {t) + T md (t), we have 

T tot {io) = T dir (u) + (1 - a)T tot (ui)P c (cj). (2) 

This equation is similar, in spirit, to the well-known ran- 
dom phase approximation [10]. Solving Eq. 2 for T tot (u;) 
and taking the inverse Fourier transform, yields 

T to t{t) ~ (T c -r d i r )exp(-t/Td ir )+(l-a)Tdi r exp(-Q!t/r c ). 

(3) 

Thus, the traffic arriving at the subset of papers of age t is 
given by the superposition of two exponential functions. 
We are now in a position to better understand what de- 
termines the optimal values of a and T d i r . Open circles 
in Fig. 3 show the age distribution of the number of re- 
cently acquired citations, Afci„, for papers in the physrev 
dataset. The approximate CiteRank traffic, given by Eq. 
3, is also displayed. It is calculated using the empirically 
determined value r c = 8 years, optimal r d i r = 2.6 years 
and three values of a = 0.2, 0.5 and 0.9. As one would 
expect, the profile of (Afci„) vs t best agrees with the Cit- 
eRank plot for the optimal value a = 0.5 [13]. Fig. 3 also 
provides some clues to the positive correlation between 
near-optimal choices of a and T d i r , visible as diagonal 
"ridges" in Fig. 1A and B. Indeed, if the value of a is 
chosen to be large, the contribution from the second term 
is diminished; the use of a larger value of r d i r could par- 
tially compensate for the loss of CiteRank traffic to older 
papers, and would thus be in reasonably good agreement 
with the Aki n data. 

Another encouraging observation is that, like Eq. 3, 
the age distribution of recently acquired citations shown 
in Fig. 3 has two regimes characterized by two different 
decay constants of about 5 and 16 years, with a crossover 
point around t = 15 years. Our interpretation of this fact 
is that papers are found and cited via two distinct mech- 
anisms: researchers can either find a paper directly or 
by following citation links from earlier papers. For each 
of these mechanisms, the probability that a given paper 
is found decays with its age but the characteristic de- 
cay time for the direct discovery is shorter. While very 
recent papers, especially the ones altogether lacking ci- 



tations, are for the most part discovered directly, older 
papers are mostly discovered by following citation links. 

The optimal values of a in the two very different ci- 
tation networks considered are remarkably close to each 
other. In both cases it appears that, on average, the 
length of chains of citations followed by a typical re- 
searcher is close to 1/a ~ 2. Since this chain includes 
the original starting point, the length of around 2 means 
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FIG. 3: The age distribution of newly accrued citations Aki n 
(blue) for the physrev network. Theoretical predictions [3] for 
the CiteRank traffic are calculated for the optimal Tdir — 2.6 
and three values of a = 0.2 (dot-dashed line), 0.5 (thick solid 
line), and 0.9 (dashed line). In agreement with Fig.l, the 
optimal value, a = 0.5, provides the best agreement with 
Aki n . All curves are normalized so that the sum of all data 
points is equal to 1. 



that the average cited paper is just one link away from 
the starting point. This raises the disconcerting possibil- 
ity that many citations may be copied directly from the 
initially discovered reference. Such citation copying was 
recently proven to be a very common scenario [12]. 
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